The purpose of the document is to detail the analysis of CNV over windows across the genome. In this case the approach is similar to that detailed in custom_cnv_v2.Rmd in that it uses a z-score (e.g. standard deviations away from the mean) and the difference in read-depth between the sample and a reference genome (in this case B73). As such, it borrows heavily from the code employed in custom_cnv_v2.Rmd, with some slight modificartions.

Generating the coverage data

The coverage information is calcualted using .bam files generated by Vince Buffalo and the following script.

#!/bin/bash -l
#!/bin/bash
#SBATCH -D /group/jrigrp4/CNVsim/data/windows
#SBATCH -o /group/jrigrp4/custom_cnv/logs/mcov_out_log-%j.txt
#SBATCH -e /group/jrigrp4/custom_cnv/logs/mcov_err_log-%j.txt
#SBATCH -J SimMulticov
#SBATCH --array=0-4
#SBATCH --mem-per-cpu=12000
#SBATCH --cpus-per-task=1

##Simon Renny-Byfield, UC Davis, June 2015

# define the windows of interest
windows=(50 250 500 1000 2000)

echo "Job Starting: "
date
echo "Working on window size ${windows[$SLURM_ARRAY_TASK_ID]}"

cmd="bedtools multicov -bams /group/jrigrp4/CNVsim/data/windows/*sorted.bam -bed <(bedtools makewindows -g ../../../custom_cnv/data/chrLenFilev2.txt -w ${windows[$SLURM_ARRAY_TASK_ID]}) -q 30 > window_${windows[$SLURM_ARRAY_TASK_ID]}bp_coverage.txt"
echo $cmd
eval $cmd

echo "Jobe Done: "
date

Next, the data have to be GC normalized using a script that looks like this:

#!/bin/bash -l
#SBATCH -D /group/jrigrp4/custom_cnv/data
#SBATCH -o /group/jrigrp4/custom_cnv/logs/gc_out_log-%j.txt
#SBATCH -e /group/jrigrp4/custom_cnv/logs/gc_err_log-%j.txt
#SBATCH -J gc
#SBATCH --array=0-2
#SBATCH --mem-per-cpu=20000
#SBATCH --cpus-per-task=1

##Simon Renny-Byfield, UC Davis, April 2015
 

windows=(500 1000 2000)

echo "Starting Job:"
date

cmd="Rscript ../scripts/gcNorm_sim.R window_${windows[$SLURM_ARRAY_TASK_ID]}bp_coverage.txt window_${windows[$SLURM_ARRAY_TASK_ID]}bp_GC.txt gc_normalized_${windows[$SLURM_ARRAY_TASK_ID]}bp_median.RData"
echo $cmd
eval $cmd

echo "Job done: "
date

This scripts calls an R script which utilizes the EDASeq package and looks like this:

###
# Normalize read-depth over sliding windows across the Maize genome
###

# Simon Renny-Byfield, June '15, UC Davis

options(echo=TRUE) # if you want see commands in output file
args <- commandArgs(trailingOnly = TRUE)
print(args)

# load in the data
chr6cov<-read.table(args[1])

norm.df<-subset(chr6cov, select=-c(1:3))
# normalize by column total
norm.df<-sweep(norm.df, 2, colSums(subset(norm.df)), FUN="/")
norm.df<-sweep(norm.df, 2, 1e6, FUN="*")

#load in teh gc data
ch6_gc<-read.table(args[2])
#attach it to norm.df

gc<-ch6_gc$V5
norm.df<-data.matrix(norm.df)
# now try to normalize the data according to GC content
chr6Coverage<-withinLaneNormalization(x=norm.df, y=gc, which="median", round=FALSE)
# modify make a new data.table of the gc normalized counts
chr6Coverage<-data.table(chr6Coverage)
chr6Coverage<-cbind(chr6cov[,1:3],chr6Coverage,gc)

save(chr6Coverage,file=args[3])

Generating the Tajima’s D over the same windows

Next we need to generate the Tajima’s D data. This is done using ANGSD as follows:

#!/bin/bash -l
#!/bin/bash
#SBATCH -D /group/jrigrp4/custom_cnv/sfs/windows
#SBATCH -o /group/jrigrp4/custom_cnv/logs/sfs_out_log-%j.txt
#SBATCH -e /group/jrigrp4/custom_cnv/logs/sfs_err_log-%j.txt
#SBATCH -J SFS&TD
#SBATCH --mem-per-cpu=10000
#SBATCH --cpus-per-task=18

##Simon Renny-Byfield, UC Davis, June 2015

echo "Starting Job:"
date

# calculate the .saf file
cmd="angsd -bam ../../../teosinte_parents/file.list.txt -doSaf 2 -out teosinte19 -anc ../../../teosinte_parents/genomes/TRIP.fa -ref ../../../teosinte_parents/genomes/Zea_mays.AGPv3.22.dna.genome.fa -GL 1 -P 18 -indF ../../../teosinte_parents/angsd_outpu$
echo $cmd
#eval $cmd

# formally calculate the SFS
cmd="realSFS teosinte19.saf 38 -maxIter 100 -P 12 > teosinte19.sfs"
echo $cmd
#eval $cmd

# no fiure out region wide thetas
cmd="angsd -bam ../../../teosinte_parents/file.list.txt -out teosinte19_thetas.sfs -doThetas 1 -doSaf 2 -ref ../../../teosinte_parents/genomes/Zea_mays.AGPv3.22.dna.genome.fa -indF ../../../teosinte_parents/angsd_output/teo_parents19e-6.indF -pest teosin$
echo $cmd
#eval $cmd

# make the .bed file
cmd="thetaStat make_bed teosinte19_thetas.sfs.thetas.gz"
echo $cmd
eval $cmd

#calculate Tajimas D
cmd="thetaStat do_stat teosinte19_thetas.sfs.thetas.gz -nChr 19 -win 1000 -step 1000 -outnames teosinte19.teothetasWindow.gz"
echo $cmd
eval $cmd

#now clean up the output..
cmd="cat teosinte19.teothetasWindow.gz.pestPG | awk '$14!="0" {print}' > teosinte19.TajD.txt"
echo $cmd
#eval $cmd

echo "End Job: "
date

Do calculate Tajima’D window of varying sizes we can use a batch script that look something like this:

#!/bin/bash -l
#!/bin/bash
#SBATCH -D /group/jrigrp4/custom_cnv/sfs/windows
#SBATCH -o /group/jrigrp4/custom_cnv/logs/sfs_out_log-%j.txt
#SBATCH -e /group/jrigrp4/custom_cnv/logs/sfs_err_log-%j.txt
#SBATCH -J SFS&TD
#SBATCH --array=0-4
#SBATCH --mem-per-cpu=5000
#SBATCH --cpus-per-task=18

##Simon Renny-Byfield, UC Davis, June 2015

windows=(2000 1000 500 250 50)

winSize=${windows[$SLURM_ARRAY_TASK_ID]}

echo "Starting Job:"
date

#calculate Tajimas D
cmd="thetaStat do_stat teosinte19_thetas.sfs.thetas.gz -nChr 19 -win $winSize -step $winSize -outnames teosinte19.teothetasWindow$winSize.gz"
echo $cmd
eval $cmd

echo "Ending Job: "
date

The data then need to be modifed using a custom perl script angsd2GenomicRanges.pl that converts the output of ANGSD into something GenomicRanges can understand.

#!/usr/bin/perl
use strict;
use warnings;

# Simon Renny-Byfield, UC Davis, March 17th 2015

#usage script.pl <angsd output> >outfile

# This script will take angsd output (.pestPG file) and convert format suitable 
# for input into GenomicRanges. The purpose is to associate regions in maize genome
# that have estimates of Tajima's D and relate those estimates to particular genes. 
# In short each window from the angsd analysis will be labelled with the corresponding 
# gene name using GenomicRanges. 

# The output from angsd looks like this..

# ## thetaStat VERSION: 0.01 build:(Oct 23 2014,15:39:05)
# #(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop)  Chr    WinCenter   tW  tP  tF  tH  tL  Tajima  fuf fud fayh    zeng    nSites
# (25,53)(1778322,1778350)(1778250,1778350) 1   1778300 1.661604    0.550991    3.218805    1.968841    1.259916    -2.122087   -1.758895   -1.220118   -0.736941   -0.208055   28
# (25,96)(1778322,1778532)(1778300,1778400) 1   1778350 3.663915    2.025260    7.192596    3.483926    2.754594    -1.622298   -1.918688   -1.580332   -0.370203   -0.240505   71
# (53,96)(1778350,1778532)(1778350,1778450) 1   1778400 2.002310    1.474270    3.973791    1.515086    1.494677    -0.869529   -1.449146   -1.362891   -0.017994   -0.225741   43
# (96,127)(1778532,1778731)(1778500,1778600)    1   1778550 0.235098    0.097436    0.000001    0.005128    0.051282    -0.980203   0.205032    0.575739    0.213366    -0.367884   31
# (114,127)(1778550,1778731)(1778550,1778650)   1   1778600 0.235098    0.097436    0.000001    0.005128    0.051282    -0.980203   0.205032    0.575739    0.213366    -0.367884   13
# (127,146)(1778731,1778750)(1778650,1778750)   1   1778700 0.000504    0.000107    0.002142    0.000003    0.000055    -0.066506   -0.096080   -0.089562   0.006211    -0.021466   19

#but a suitable input into GenomicRanges might look like this

# ## thetaStat VERSION: 0.01 build:(Oct 23 2014,15:39:05)
# #(indexStart,indexStop)(firstPos_withData,lastPos_withData)(WinStart,WinStop) Chr WinCenter   tW  tP  tF  tH  tL  Tajima  fuf fud fayh    zeng    nSites
# 1 1778250 1778350 1778300 1.661604    0.550991    3.218805    1.968841    1.259916    -2.122087   -1.758895   -1.220118   -0.736941   -0.208055   28

#where the first three columns are chr, start, stop.

####
# Read in the data
####

# open the input file

open ( IN , "<$ARGV[0]" ) || die "could not open file:$!\n";

####
# Cycle through the file modifying each line
####

while ( <IN> )  {
    if ( m/##/ ) {
        print;
        next
    }
    if ( m/#\(/ ) {
        print "chr  start   stop    WinCenter   tW  tP  tF  tH  tL  Tajima  fuf fud fayh    zeng    nSites\n";
        next
    }#if
    chomp;
    my @data = split "\t";
    # grab the start and stop of each window
    $data[0] =~ m/\(\d+,\d+\)\(\d+,\d+\)\((\d+,\d+)\)/;
    my $new = $1;
    #replace a comma with tab
    $new =~ s/,/\t/;
    #print the new line as you want it.
    print $data[1] , "\t" , $new , "\t" , join( "\t" , @data[2 .. $#data]) , "\n";
}#while

exit;

Unifiying the coverage and Tajima’D datasets

Unfortunately not all windows are present in the resulting output (I think this is because not all windows have data) and so we need an additional few steps to make sure we keep the windows in order. To do this we can take advantge of the GenomicRanges package in R.

First load in some libraries, the list of these can be expanded as we need them.

library(data.table)
library(GenomicRanges)
library(rtracklayer)
library(scales)
library(ggplot2)

# now set the working dir
setwd("/Users/simonrenny-byfield/GitHubRepos/cnvwin")

Then we can unify the data.

# load in the TD data
td.data<-fread("/Users/simonrenny-byfield/GitHubRepos/cnvwin/data/TD_windows_2000.txt")
## 
Read 0.0% of 1029793 rows
Read 1029793 rows and 15 (of 15) columns from 0.112 GB file in 00:00:05
# turn into a GRanges object
td.ranges<-GRanges(seqnames=td.data$chr, ranges=IRanges(td.data$start,td.data$stop),td=td.data$Tajima)

# load in the coverage data
load("/Users/simonrenny-byfield/GitHubRepos/cnvwin/data/gc_normalized_2000bp_median.RData")
gcNorm<-chr6Coverage
# turn into a GRanges object
gcNorm<-GRanges(seqnames=gcNorm$V1, ranges=IRanges(gcNorm$V2,gcNorm$V3))

###
# Check for overlaps and unify the data
###

# first make a variable not modify, with NA values
values(gcNorm)$td<-NA
# now check for overlaps, require "equal" because we want complete overlap
overlaps<-as.matrix(findOverlaps(gcNorm,td.ranges, type="equal"))
# use the hit matrix to add a value of TD to the coverage data set
values(gcNorm)$td[overlaps[,1]]<-td.ranges$td[overlaps[,2]]

# now turn in back into a datatable
gcNorm<-data.table(chr=as.vector(seqnames(gcNorm)),start=start(gcNorm),end=end(gcNorm),td=gcNorm$td)
#remove the positional info and the gc info
gcNorm<-cbind(gcNorm,chr6Coverage[-c(1:3,26)])

# do some clearing up
chr6Coverage<-NULL
gc()
##            used  (Mb) gc trigger  (Mb) max used  (Mb)
## Ncells  3192879 170.6    4679654 250.0  3933533 210.1
## Vcells 43198487 329.6   99135496 756.4 94215127 718.9

Load in the gene annotation data

Now we have unified the coverage and Tajima’s D datasets we can now begin to load in the gene annotation data. Included with this will be the neccesary maize1 and maize2 data.

# load in the annotation data
gff<-fread("/Users/simonrenny-byfield/GitHubRepos/cnvwin/data/AGPv3.22_primary_transcripts.bed")
gff<-GRanges(seqnames=gff$V1, ranges=IRanges(gff$V2,gff$V3),type=gff$V4,name=gff$V5)
# remove scaffolds
gff<-subset(gff,subset=seqnames %in% c(1:10))

# prepare the GRanges object to receive the sub-genome info
values(gff)$sub.genome<-"no syn"
# load in the maize1 maize 2 subgenomes info
subGenomes<-read.csv("/Users/simonrenny-byfield/GitHubRepos/custom_cnv/data/gene_by_sugenome.csv",na.strings=c("",".","NA"))
# attribute the correct sub-genome to each annotation
values(gff)$sub.genome[gff$name %in% subGenomes[,2]]<-"M1"
values(gff)$sub.genome[gff$name %in% subGenomes[,3]]<-"M2"

Add samples names to the dataset

samples<-read.table("/Users/simonrenny-byfield/GitHubRepos/cnvwin/data/sample_list.txt")
samples<-as.vector(gsub(".sorted.bam","",samples$V1))
setnames(gcNorm,colnames(gcNorm),c(colnames(gcNorm)[1:4],samples))
samples<-samples[samples != "JRIAL2T"]

Filtering the data

Before we start, it might be a good idea to screen out windows that have really high coverage in B73, these are likely to represent repeatitive or poorly assembled region of the genome and should be excluded from further consideration.

#set the max read depth
maxDepth<-10
minDepth<-0.01
gcNorm<-subset(gcNorm,subset=gcNorm$B73 < maxDepth )
gcNorm<-subset(gcNorm, subset= gcNorm$B73 > minDepth)

Runnig the CNV calling function

source("/Users/simonrenny-byfield/GitHubRepos/cnvwin/scripts/callCNV.R")
cnvs<-callCNV(gcNorm,samples,limit=0.7,limitHom=3.2)
#dev.off()
# take a look at the genic region in terms of coverage
cov<-GRanges(seqnames=gcNorm$chr, ranges=IRanges(gcNorm$start,gcNorm$end),coverage=subset(gcNorm,select="B73"))
# check for overlaps between gcNorm and gff
geneData<-subsetByOverlaps(cov,gff,type = "within")

Next, the best thing to do it to calculate the allele frequencies based on the genotype calls. We can do this using an apply function over the gcNorm table, using sum as a function. Later we can trim the CNVs we want by looking at only those with exclusively up, or down calls etc. At this stage:

# re-attribute the gcNorm data frame
gcNorm<-cnvs$cnvs
# remove the strange sample 
gcNorm<-gcNorm[,JRIAL2T:=NULL]
gcNorm<-gcNorm[,JRIAL2T_CNV:=NULL]
freqs<-apply(subset(gcNorm,select=c(27:45)), 1 , function(x) sum(x))
# add a new column to the gcNorm table, note this is the frequency of the DNA segment
gcNorm<-gcNorm[,freq:=freqs]

Examinig “Down” CNVs: SFS vs SNPS

We probably want to focus on only those regions of the genome that have been called as “down” copy-number variants. These are regions that appear to be in lower copy number in one or more of the Palmar Chico individuals when compared with the reference B73.

#now try to look at the 'down' CNV
downGroup<-gcNorm[apply(subset(gcNorm,select=c(27:45)),1,function(x) !all(x == 2) & !any(x==3) ),]
# calculate the sfs for deletion
propVariantSites<-rev(table(downGroup$freq)/sum(table(downGroup$freq)))
propVariantSites<-propVariantSites[-38]
# do it for a folded SFSs
#min(p,38-p)

#load in the snp sfs data
SNPsfs<-exp(scan("/Users/simonrenny-byfield/GitHubRepos/cnvwin/data/teosinte19.sfs"))
SNPsfs<-SNPsfs[-c(1,length(SNPsfs))]# removed fixed classes
SNPsfs<-(SNPsfs/sum(SNPsfs))

# plot the SFSs
b<-barplot(rbind(SNPsfs,c(propVariantSites)),beside=TRUE, col = c("cornflowerblue","orange"), xlab="frequency",
          ylab="proportion of variant sites", xaxt="n", cex.axis=1.4, cex.lab=1.4, main="unfolded SFS")
Axlabel<-c(1,5,10,15,20,25,30,35)
AxAt<-(b[1,]+b[2,])/2
AxAt<-AxAt[seq(1,length(AxAt),by=5)]
axis(1,at=AxAt,labels=Axlabel,cex.axis=1.4)
legend("topright", legend=c("SNPs","down CNVs"), pch=15, col=c("cornflowerblue","orange"), cex=1.5, bty="n")

Because we cannot polarize in CNV in B73 vs Palmar Chico it is best to draw a folded SFS.

# calculate the folded SFSs for down CNVs
alleleDiff<-38-downGroup$freq
d<-cbind(alleleDiff,downGroup$freq)
MAF<-apply(d,1,min)
foldedPropVariantsSites<-table(MAF)/sum(table(MAF))
foldedPropVariantsSites<-foldedPropVariantsSites[-c(1,length(foldedPropVariantsSites))]
# load in the folded SFS for SNPs
foldsfs<-exp(scan("/Users/simonrenny-byfield/GitHubRepos/cnvwin/data/folded.teosinte19.sfs"))
foldsfs<-foldsfs[-c(1,length(foldsfs))]# removed fixed classes
foldsfs<-(foldsfs/sum(foldsfs))

b<-barplot(rbind(foldsfs,foldedPropVariantsSites),beside=TRUE, col = c("cornflowerblue","orange"), xlab="frequency",
          ylab="proportion of variant sites", cex.axis=1.6, cex.lab=1.4, main="folded SFS")
#Axlabel<-c(1,5,10,15,20)
#AxAt<-(b[1,]+b[2,])/2
#AxAt<-AxAt[seq(1,length(AxAt),by=5)]
axis(1,at=AxAt,labels=Axlabel,cex.axis=1.6)
legend("topright", legend=c("SNPs","down CNVs"), pch=15, col=c("cornflowerblue","orange"), cex=1.5, bty="n")

Are CNV in Hardy-Weinberg Frequency?

The next step is to analyse CNVs with respect to HWE. Do gentypes end up in HWE given out data. If so this gives us added weight that these are biologically real (as they follow expected patterns of segregation) and secondly if they are in likely to be seggregating neutrally. In this instance we will focus on the “down CNVs”, the HWE expectation is clear in this case, and most of our further analyses will be on this category of CNV.

# first caculate the frequency (proportion) of the CNV, and the "normal" DNA.
DNA_prop<-downGroup$freq/38
downGroup<-downGroup[,DNA_prop:=DNA_prop]
# calculate the expected gentype proportions according to HWE
homDNA<-DNA_prop^2  #this corresponds to genotype call 2 p^2
homCNV<-(1-DNA_prop)^2# this corresponds to genotype call 0, q^2
hetero<-(2*(1-DNA_prop)*(DNA_prop))#  this corresponds to genotype call 1, 2pq

# now what about the observed proportions
obs.dt<-t(apply(subset(downGroup,select=c(27:45)), 1 , function(x) (table(factor(x, levels=0:2)))))

# bind these vectors together
obs.dt<-as.data.frame(obs.dt)
colnames(obs.dt)<-c("obs.homoCNV","obs.hetero","obs.homoDNA")

# now assemble the expected genotypes frequencies
# the proportion will be multiplied by 19 so they are directly comparable to the
# observed genotype calls.

exp.dt<-cbind("exp.homoCNV"=homCNV,"exp.hetero"=hetero,"exp.homoDNA"=homDNA)*19
#myList<-list("o"=obs.dt,"e"=exp.dt*19)

#pvals<-apply(cbind(obs.dt,exp.dt), 1 , function(x) { fisher.test(rbind(x[1:3],x[4:6]))$p.value})

#hist(pvals,col="darkgrey",main="fisher's exact test", xlab="p-value", cex.axis=1.3,cex.lab=1.3)
# add in the observed, expected and p-value
downGroup<-cbind(downGroup,exp.dt,obs.dt)
#<-downGroup[,"hwe.pvalue":=pvals]
downGroup<-downGroup[,"prop.DNA.homozygotes":=(downGroup$obs.homoDNA/19)]
downGroup<-downGroup[,"prop.CNV.homoozygotes":=(downGroup$obs.homoCNV/19)]
downGroup<-downGroup[,"prop.heterozygotes":=(downGroup$obs.hetero/19)]

ggplot(data=downGroup,aes(x=freq, y=prop.DNA.homozygotes)) +
  theme_bw()+
  geom_smooth(size=2)+
  geom_smooth(data=downGroup,aes(x=freq, y=prop.CNV.homoozygotes, colour="red",size=2))+
  geom_smooth(data=downGroup,aes(x=freq, y=prop.heterozygotes, colour="green",size=2))

Where are CNV across the genome?

source("/Users/simonrenny-byfield/GitHubRepos/cnvwin/scripts/data.frame2GRanges.R")
downGRanges<-data.frame2GRanges(as.data.frame(downGroup),keepColumns = TRUE)

#load in the MB windows...
MB.windows<-read.table("/Users/simonrenny-byfield/GitHubRepos/custom_cnv/data/MB.intervals.bed")

# turn into a GRanges object
MB.windows<-GRanges(seqnames=MB.windows$V1,ranges=IRanges(MB.windows$V2,MB.windows$V3), "middle"=(MB.windows$V2+MB.windows$V3)/2 )

# find the proportion of windows that are called as CNV
MB.hits<-as.matrix(findOverlaps(MB.windows,downGRanges))
propCNV<-table(MB.hits[,1])/500

# now attribute the propCNV to the MB.windows
MB.windows$propCNV<-0
MB.windows$propCNV[as.numeric(names(table(MB.hits[,1])))]<-propCNV


#save(MB.windows,file="MB.windows.RData")
# attribute the local proportion of CNVs to recombination rate
mb.down<-as.matrix(findOverlaps(downGRanges,MB.windows))
# now add the appropriate value of prop CNV
values(downGRanges)$propCNV[mb.down[,1]]<-MB.windows$propCNV[mb.down[,2]]

# now plot the data

deletionPlot<-ggplot(data=as.data.frame(downGRanges)[as.data.frame(downGRanges)$seqnames %in% c(3,5,6,7),], aes(x=start/10^6,y=1-(freq/38),colour=propCNV),labels = c("chr3","chr5",rep("chr6",2))) +
  theme_bw()+
  geom_point(size= 2)+
  #geom_smooth(data=as.data.frame(downCNVs),x=start,y=freq,color = "red")
  facet_grid(seqnames ~ .)+
  xlab("position (Mbp)") +
  ylab("CNV frequency") +
  scale_y_continuous(breaks=c(0,0.5,1))+
  scale_colour_gradientn(colours=rainbow(5),name="local\ndensity\nof CNVs") +
  guides(fill=guide_legend(title="proportion of CNVs")) +
  #geom_smooth(data = as.data.frame(chr6Coverage), aes(x=start/10^6,y=JRIAL2D/1000, col = chr))+
  theme(axis.title.x = element_text(size=a.size,vjust=-0.3),axis.title.y = element_text(size=a.size,vjust=1.6,hjust=0.75),
        axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-15),
        strip.text.y=element_text(size=25,angle=-90),
        legend.title=element_text(size=15),legend.text=element_text(size=12),legend.key.size=unit(1,"cm"),
        legend.background=element_blank())

print(deletionPlot)

What about covergae over the genome. Are deletions obvious looking at coverage..

ggplot(data=subset(gcNorm,subset=chr %in% c(3,5,6)), aes(x=start/10^6,y=JRIAL2A/B73)) +
  theme_bw()+
  ylim(0,3)+
  facet_grid( chr ~ . )+
  geom_point(size=0.3)+
  xlab("position (Mbp)") +
  ylab("coverage ratio (JRIAL2G/B73)") +
  stat_smooth(data=subset(gcNorm,subset=chr %in% c(3,5,6)), aes(x=start/10^6,y=JRIAL2A/B73, col = chr), span=0.025,method = "loess", size=1.7,colour="red")+
  theme(axis.title.x = element_text(size=a.size,vjust=-0.3),axis.title.y = element_text(size=a.size,vjust=1.6,hjust=0.75),
        axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-15),
        strip.text.y=element_text(size=25,angle=-90),
        legend.title=element_text(size=15),legend.text=element_text(size=12),legend.key.size=unit(1,"cm"),
        legend.background=element_blank())

ggplot(data=subset(gcNorm,subset=chr == 6 & start<25*10^6 & start>5*10^6), aes(x=start/10^6,y=JRIAL2B/B73)) +
  theme_bw()+
  ylim(-4,4)+
  #facet_grid( chr ~ . )+
  geom_point(size=0.8, colour="darkgreen")+
  geom_point(aes(x=start/10^6,y=td),size=0.8,colour="red4")+
  xlab("position (Mbp)") +
  ylab("coverage ratio (JRIAL2G/B73)") +
  stat_smooth(data=subset(gcNorm,subset=chr ==6 & start<25*10^6 & start>5*10^6), aes(x=start/10^6,y=JRIAL2B/B73, col = chr), span=0.08,method = "loess", size=1.7,colour="darkgreen")+
  stat_smooth(data=subset(gcNorm,subset=chr ==6 & start<25*10^6 & start>5*10^6), aes(x=start/10^6,y=td, col = chr), span=0.08,method = "loess", size=1.7,colour="red4")+
  theme(axis.title.x = element_text(size=a.size,vjust=-0.3),axis.title.y = element_text(size=a.size,vjust=1.6,hjust=0.75),
        axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-15),
        strip.text.y=element_text(size=25,angle=-90),
        legend.title=element_text(size=15),legend.text=element_text(size=12),legend.key.size=unit(1,"cm"),
        legend.background=element_blank())

### do the same for TIL01
ggplot(data=subset(gcNorm,subset=chr == 6 & start<25*10^6 & start>5*10^6), aes(x=start/10^6,y=TIL01/B73)) +
  theme_bw()+
  ylim(-4,4)+
  #facet_grid( chr ~ . )+
  geom_point(size=0.8, colour="darkgreen")+
  geom_point(aes(x=start/10^6,y=td),size=0.8,colour="red4")+
  xlab("position (Mbp)") +
  ylab("coverage ratio (JRIAL2G/B73)") +
  stat_smooth(data=subset(gcNorm,subset=chr ==6 & start<25*10^6 & start>5*10^6), aes(x=start/10^6,y=TIL01/B73, col = chr), span=0.08,method = "loess", size=1.7,colour="darkgreen")+
  stat_smooth(data=subset(gcNorm,subset=chr ==6 & start<25*10^6 & start>5*10^6), aes(x=start/10^6,y=td, col = chr), span=0.08,method = "loess", size=1.7,colour="red4")+
  theme(axis.title.x = element_text(size=a.size,vjust=-0.3),axis.title.y = element_text(size=a.size,vjust=1.6,hjust=0.75),
        axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-15),
        strip.text.y=element_text(size=25,angle=-90),
        legend.title=element_text(size=15),legend.text=element_text(size=12),legend.key.size=unit(1,"cm"),
        legend.background=element_blank())

What annotations are seggregating as CNV

# define some useful colors
colss<-c("cornflowerblue", "red3", "grey","darkolivegreen4","purple3","yellow3","orange","red4")

# plot some bar plots
x<-barplot((table(factor(subsetByOverlaps(gff,reduce(downGRanges),type="within" )$type, levels=unique(gff$type)))/table(factor(gff$type,levels=unique(gff$type)))),
            #names=unique(gff$type),
            col=alpha(colss,0.85), cex.axis=1.8, cex.names = 0.8, 
            main="Proportion of annotations scored as CNV",xaxt="n", ylab = "proportion CNV", cex.lab=1.8)
labs <- factor(unique(gff$type))
text(cex=1.0, x=x+0.3, y=-0.003, labs, xpd=TRUE, srt=45, pos=2)

x<-barplot(table(subsetByOverlaps(gff,reduce(downGRanges),type="within")$type),
           col=colss, cex.axis=1.8, cex.names = 1.2, 
           ylab="number of annotations called as CNV",cex.lab=1.8, xaxt="n")
labs <- names(table(gff$type))
text(cex=1.0, x=x+0.3, y=-400, labs, xpd=TRUE, srt=45, pos=2)

Genes and CNVs..

How are genes impacted by CNVs, are certain annotation types more susceptable to tolerating CNV and what are the frequencies of CNVs over each annotation type.

# a list of annotation types, first clearing up the data
gff$type[gff$type == "miRNA_gene"]<-"miRNA"
gff$type[gff$type == "three_prime_UTR"]<-"3'-UTR"
gff$type[gff$type == "five_prime_UTR"]<-"5'-UTR"
annotation<-unique(gff$type)

# for each annotation type generate an SFS

for ( i in annotation[-7] ) {
  # subset the gff
  annot<-subset(gff,subset=type == i)
  # overlap these genes with the down CNVs
  annotDown<-subsetByOverlaps(downGRanges, annot)
  propVariant<-rev(table(factor(annotDown$freq, levels=0:38))/sum(table(factor(annotDown$freq,levels=0:38))))
  propVariant<-propVariant[-c(1,38)]
  b<-barplot(rbind(SNPsfs,propVariant[-38]), main=i,col=c("cornflowerblue","orange"),beside=T, xlab="frequency",
          ylab="proportion of variant sites", xaxt="n", cex.axis=1.4, cex.lab=1.4)
  #Axlabel<-c(1,5,10,15,20,25,30,35)
  #AxAt<-(b[1,]+b[2,])/2
  #AxAt<-AxAt[seq(1,length(AxAt),by=5)]
  legend("topright", legend=c("SNPs","down CNVs"), pch=15, col=c("cornflowerblue","orange"), cex=1.8, bty="n")
}# for

Deletions in maize1 vs maize2

How to CNV segregate according the maize1 and maize2. Do either of the two sub genomes show any bias in hte number of seggregating CNV? What about the collection of genes that do not belong to either of the two maize sub-genomes (i.e. those without any syntenic ortholofs between Sorghum and maize)?

downGenes<-subsetByOverlaps(downGRanges,gff, minoverlap = 500)
# attribute maize1 and maize2 designation to the down CNVs
sub.genome.hits<-as.matrix(findOverlaps(downGenes,gff))
values(downGenes)$sub.genome<-"NA"
values(downGenes)$sub.genome[sub.genome.hits[,1]]<-gff$sub.genome[sub.genome.hits[,2]]

table(downGenes$sub.genome) # total genes annotations seggregating as CNVs
## 
##     M1     M2 no syn 
##   1833   1067   6008
table(gff$sub.genome) # total annotations of each type
## 
##     M1     M2 no syn 
## 167907  97720 200110
table(downGenes$sub.genome)/table(gff$sub.genome)
## 
##         M1         M2     no syn 
## 0.01091676 0.01091895 0.03002349

Genome-wide Tajima’s D

ggplot(data=as.data.frame(gcNorm[gcNorm$chr %in% c(1,3,6)]),aes(y=td,x=start, colour = td)) +
  theme_bw()+
  facet_grid( chr ~ . )+
  geom_point(size=0.8)+
  geom_smooth(colour="darkgrey",size=2,span=0.001)+
  geom_hline(aes(yintercept=0), linetype="dashed", size=2,col = "white")+
  xlab("position (Mbp)") +
  ylab("Tajima's D") +
  scale_colour_gradientn(colours=rainbow(5)) +
  theme(axis.title.x = element_text(size=a.size,vjust=-0.3),axis.title.y = element_text(size=a.size,vjust=1.6),
        axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-10),
        strip.text.y=element_text(size=25,angle=0),
        legend.title=element_blank(), legend.text=element_text(size=22),legend.key.size=unit(1.1,"cm"),
        legend.position=c(0.9,0.34),legend.background=element_blank())

Tajima’s D and CNV frequency

Next if is important to examine Tajima’s D over CNV of various frequency.

####
# Do some plotting
####

ggplot(data=downGroup, aes(x=(38-freq)/38, y=td)) +
  theme_bw()+
  geom_smooth(colour="darkgreen", size=2)+
  ylab("Tajima's D")+
  xlab("CNV frequency")+
  theme(axis.title.x = element_text(size=a.size,vjust=-0.3),axis.title.y = element_text(size=a.size,vjust=1.6),
        axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-5),
        strip.text.y=element_text(size=25,angle=0),
        legend.background=element_blank())

But, what about removing that “deletion” spot from chromasome 6, to exclude the posibility that this is the sole cause of the observed trend.

ggplot(data=downGroup[downGroup$chr != 6,], aes(x=(38-freq)/38, y=td)) +
  theme_bw()+
  geom_smooth(colour="darkgreen", size=2)+
  ylab("Tajima's D")+
  xlab("CNV frequency")+
  theme(axis.title.x = element_text(size=a.size,vjust=-0.3),axis.title.y = element_text(size=a.size,vjust=1.6),
        axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-5),
        strip.text.y=element_text(size=25,angle=0),
        legend.background=element_blank())

What about Tajima’s D next to CNV calls. Do we see any evidence a variable Tajima’s D in these regions??

Now look at the Tajima’D data over genic regions:

# try the same plot but for genes
ggplot(data=as.data.frame(downGenes), aes(x=(38-freq)/38, y=td)) +
  theme_bw()+
  geom_smooth(colour="darkgreen", size=2)+
  ylab("Tajima's D")+
  xlab("CNV frequency")+
  theme(axis.title.x = element_text(size=a.size,vjust=-0.3),axis.title.y = element_text(size=a.size,vjust=1.6),
        axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-5),
        strip.text.y=element_text(size=25,angle=0),
        legend.background=element_blank())

Now use the maize1 and maize2 subgenome data to polarize deletions so that we can be sure they really are deletions in Palmar Chico.

# now plot maize1 and maize2
ggplot(data=as.data.frame(downGenes), aes(x=(38-freq)/38, y=td,fill=sub.genome)) +
  theme_bw()+
  facet_wrap(~sub.genome, nrow=3) +
  stat_smooth(color="grey34",size = 2, alpha = 0.6)+
  scale_fill_manual(values = c("blue", "red", "grey"))+
  ylab("Tajima's D")+
  xlab("CNV frequency")+
  theme(axis.title.x = element_text(size=a.size,vjust=-1.6),axis.title.y = element_blank(),
          axis.text.x = element_text(size=a.size),axis.text.y = element_text(size=a.size-5),
          plot.margin=unit(c(m.size, m.size, m.size, m.size), "in"),
          strip.text.x=element_text(size=25,angle=0),strip.background=element_blank(),
          legend.position="none")

Looking for patterns of selection across the genome

Next we can examine the data for signatures of selective sweeps within the Palmar Chico population. To do this we can use the program SweeD. SweeD is based on the algprthimn presetned in Neilsen et al., 2005 Genome Research and formulated in the program SweepFinder.

Firstly, we take the mafs.gz file from ANGSD and use a per script to convert the allele frequency data into the format suitable for SweeD and SweepFinder.

#!usr/bin/perl
use strict;
use warnings;

# usage: script.pl <mafs.gz>

#####
# Convert a ANGSD mafs.gz file to SweepFinder format.
#####

#declare some useful variables
my $previousChr=0;

# first open the maf output from ANGSD

open( MAFS,"<$ARGV[0]" ) || die "Could not open file:$!\n";

# print a needed header

# scan through the input file
while ( <MAFS> ) {
  next if m/position/;
    chomp;
    # a variable to hold the frequency
    my $freq;
    # assume the SNP is folded
    my $fold=0;
    # split the incoming line;
    my ( $chromo, $position, $major, $minor, $ref, $anc, $knownEM, $nInd ) =
            split /\t/;
    
    # see if we need to open up a new file
    if ( $previousChr ne $chromo ) {
        close OUT;
        # open up the initial output file
        my $file = "$chromo"."_sweepfinder_input.txt";
        open( OUT, ">$file" );
        print OUT "position\tx\tn\tfolded\n";
    }# if
    $previousChr = $chromo;     
    # if we have a ancestral state declare the SNP to be unfolded.
    if ( $anc eq "N" ) { $fold=1; }
    # next if the frequency of the minor allele is zero.
    next if $knownEM == 0;
    # if the minor is equal to the anc, then we need to flip the frequency
    # because the major allele is now the derived.
    if ( $minor eq $anc ) {
        $freq = (2*$nInd)-($knownEM*($nInd*2));
    }# if
    else {
        $freq=$knownEM*(2*$nInd);
    }# else
    next if $freq == 0;
    print OUT $position , "\t"; 
    printf OUT ("%.0f",$freq);
    print OUT "\t";
    print OUT 2*$nInd , "\t" , $fold , "\n";
}#while

exit;

This outputs one file per chromosome and these files are then run as a batch script using:

#!/bin/bash -l
#!/bin/bash
#SBATCH -D /group/jrigrp4/custom_cnv/sfs/windows
#SBATCH -o /group/jrigrp4/custom_cnv/logs/sfs_out_log-%j.txt
#SBATCH -e /group/jrigrp4/custom_cnv/logs/sfs_err_log-%j.txt
#SBATCH -J SweeD
#SBATCH --array=0-9
#SBATCH --mem-per-cpu=11000
#SBATCH --cpus-per-task=1

##Simon Renny-Byfield, UC Davis, July 2015

echo "Starting Job:"
date

files=( *_sweepfinder_input.txt )

cmd="SweeD -name ${files[$SLURM_ARRAY_TASK_ID]}_output -input ${files[$SLURM_ARRAY_TASK_ID]} -grid 1000" 
echo $cmd
eval $cmd

echo "Ending Job: "
date

Next we need to simulate data using ms such that we can get an estimate of the distribution of the composite likelihood statistic under a neutral model (i.e. give us some estimate of the significance).

To do this we can invoke ms. We will do simulate 38 haploid samples (same as our population), 100 times. We need to provide the theta value -t which is quoted as 4N\mu. Assuming an N of 150000 and a mutation rate of 3 x 10 the value of theta is 0.018 per base pair. We want to simulate over a 5 Mb region so our theta is

910^{4}

The ms sims can be run as such:

 ms 38 100 -t 9000 -p 6 > test.ms

The Chromosome 6 “problem”

Now load in an example data set from the SweeD output. I’m picking chromosome six because of the apparently interesting region that is seems to be a deletion in Palmar Chico.

SweeD<-fread("/Users/simonrenny-byfield/GitHubRepos/cnvwin/data/SweeD_Report.6_sweepfinder_input.txt_output")

# read in the ITS Blast data
blst<-read.table("/Users/simonrenny-byfield/GitHubRepos/cnvwin/data/chr6_rDNA.blst")

# filter the data
blst<-data.table(blst)
# remove any duplicate rows
blst<-(unique(blst))
blst<-subset(blst,subset=V2 > 98 & V3 < 25000000)
plot(SweeD$Position/10^6,SweeD$Likelihood, xlim=c(5,25), type = "n", ylab = "CLR statistic", xlab="position (Mbp)", cex.lab=1.5, cex.axis=1.5, ylim=c(0,100))
lines(SweeD$Position/10^6,SweeD$Likelihood, col = "red")
lines(start(subset(MB.windows,subset=seqnames==6))/10^6,subset(MB.windows,subset=seqnames==6)$propCNV*100, col = "blue")
#points(start(subset(gff,subset=seqnames==6))/10^6,y=rep(30,length(start(subset(gff,subset=seqnames==6))/10^6)), col = "green", pch = 1)
#points(start(subset(downGenes,subset=seqnames==6))/10^6,y=rep(30,length(start(subset(downGenes,subset=seqnames==6))/10^6)), col = "blue", pch = 1, cex=0.4)
#points(blst$V3/10^6, blst$V6/100)
par(new=T)
plot(density(blst$V3/10^6),xlim=c(5,25),axes=FALSE,main=NA,xlab=NA,ylab=NA, col="green")
legend("topright", legend=c("CLR score","density of 18S rDNA","% CNV calls"), col=c("red","green","blue"), pch = 15)